Finding Social Science Data for Your Research

Josh Quan

UC Berkeley Library

Fall 2017

An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.

-John Tukey

Plan your Research with a Literature Review

http://www.lib.berkeley.edu/

Ask Yourself…

  • How feasible or doable is your research question?

  • Can you answer the question with a simple descriptive statistic (like an average, median, percentage, etc)? If so, then your research question might be too narrow.

  • How many observations do you need?

  • Does the answer to your question have too many angles? If so, then your question might be too broad to answer on time.

The structure and availability of data

Unit of Analysis Geography Time-Period Frequency
For which level do you want data? Summary or Micro? (individuals, counties, nations) Is there a geographic component to your topic? (U.S., Sub-Saharan Africa, India) Do you want a data for a specific time period? (1980-2000, 1930-1960) How often do you want measures for your variables? (every year, every ten years, monthly, quarterly)

Data Providers

Researchers Government Agencies NGOs Research Organizations
Are there people you know who are doing this kind of research? Think about government agencies - is the request for some official statistics or data that they’d be likely to collect and publish? (industry, agriculture, construction, disease, crime) Are there councils or interest organizations devoted to the topic that might collect data independently? (HIV/AIDS, drugs, civil rights) Would any specific research organizations be interested in the topic? (Pew, Roper, Gallup, NORC, NBER, World Bank, OECD)

Library Research Guides

http://guides.lib.berkeley.edu/all-guides

Mind the 80/20 Rule

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. -Dasu and Johnson, 2003

Web Scraping

http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/

url='http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/'
TAB=read_html(url)%>%html_nodes('td')%>%html_text()
NAMES=read_html(url)%>%html_nodes('th')%>%html_text()
M=data.frame(matrix(TAB,ncol=5,nrow=9,byrow=T))
M=cbind(NAMES[7:15],M)
names(M)=NAMES[1:6]
M
##   Gross indices (2010=100)     I    II   III     IV  Year
## 1                     2008  99.9 101.2 101.0  102.3 101.1
## 2                     2009 101.0  99.7 100.5   98.9 100.0
## 3                     2010  99.4  99.8 100.0  100.8 100.0
## 4                     2011 102.9 103.2 104.5  105.1 103.9
## 5                     2012 105.7 106.1 106.0  105.6 105.9
## 6                     2013 105.4 105.4 106.7  107.1 106.1
## 7                     2014 107.3 107.2 107.4  107.6 107.4
## 8                     2015 108.6 108.8 109.3  109.5 109.1
## 9                     2016 110.3 110.7 110,8  111,3 110.8

Text-mining

http://guides.lib.berkeley.edu/text-mining

D-Lab, Library Data Lab, Statistics Department

Questions?